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Abstract: We consider n-by-n matrices whose (i,j)-th entry is f(XfXj), 
where Xi, . . . , X n are i.i.d. standard Gaussian random vectors in W, and 
/ is a real-valued function. The eigenvalue distribution of these random 
kernel matrices is studied at the "large p, large n" regime. It is shown 
that, when p, n — > oo and p/n = 7 which is a constant, and / is properly 
scaled so that Var(f(X^fXj)) is 0(p _1 ), the spectral density converges 
weakly to a limiting density on R. The limiting density is dictated by a 
cubic equation involving its Sticltjes transform. While for smooth kernel 
functions the limiting spectral density has been previously shown to be 
the Marccnko-Pastur distribution, our analysis is applicable to non-smooth 
kernel functions, resulting in a new family of limiting densities. 
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1. Introduction 

In recent years there has been significant progress in the development and ap- 
plication of kernel methods in machine learning and statistical analysis of high- 
dimcnsional data [13]. These methods include kernel PCA (Principal Component 
Analysis), the "kernel trick" in SVM (Support Vector Machine), and non-linear 
dimensionality reduction [5, 6], to name a few. In such kernel methods, the in- 
put is a set of n high-dimensional data points X\ , . . . , X n from which an n-by-n 
matrix is constructed, where its («,j)-th entry is a symmetric function of Xi 
and Xj . Whenever the function depends merely on the inner- product Xj Xj , it 
is called an inner-product kernel matrix. 

In this paper we study the spectral properties of annxn symmetric random 
kernel matrix A whose construction is as follows. Let X\, . . . , X n be n i.i.d 
Gaussian random vectors in MP, where Xi ~ A/"(0,p _1 / p ) and I p is the p x p 
identity matrix. That is, the np-many coordinates {(Xi)j, 1 < i < n, 1 < j < p] 
are i.i.d Gaussian random variables with mean and variance p -1 ■ The entries 
of A arc defined as 



where f(£;p) is a real- valued function possibly depending on p. We will later 
consider another model where Xi are drawn from the uniform distribution over 




(1.1) 
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the unit sphere S^ 1 in W. 

The study of the spectrum of large random matrices, since Wigner's semi- 
circle law, has been an active research area motivated by applications such as 
quantum physics, signal processing, numerical linear algebra, statistical infer- 
ence, among others. An important result is the Marcenko-Pastur (M.P.) law 
[12] for the spectrum of random matrices of the form S = XX T (also known 
as Wishart matrices) , where X is a p-hy-n (complex or real) matrix with i.i.d 
Gaussian entries. In the "large p, large n" limit, i.e. p, n — ¥ oo and p/n = 7 
(0 < 7 < 00), the spectral density of S converges to a deterministic limit, 
known as the Marcenko-Pastur distribution, which has 7 as its only parameter. 
We refer the reader to [2], [18] and [4, Chapters 1-3] for an introduction of these 
topics. Notice that Wishart matrices share the non-zero eigenvalues with their 
corresponding Gram matrices G = X T X, the latter of which, neglecting the 
difference at the diagonal entries, can be considered as a kernel matrix as in 
Eqn. (1.1) with the linear kernel function f(£;p) = £. Thus, the M.P. law and 
other results involving Wishart matrices can be translated to the Gram matrix 
case. 

The spectrum of inner-product random kernel matrices with kernel functions 
that are locally smooth at the origin has been studied in [9]. It was shown that, 
in the limit p, n — > 00 and p/n = 7, 

(1) whenever / is locally C 3 , the non-linear kernel matrix converges asymp- 
totically in spectral norm to a linear kernel matrix; 

(2) with less regularity of / (locally C 2 ), the weak convergence of the spectral 
density is established. 

We refer to [9] and references therein for more details, including a complete 
review of the origins of this problem. The problem we study here is similar to 
the one considered in [9], except that we allow the kernel function / to belong 
to a much larger class of functions, in particular, / can be discontinuous at the 
origin. 

Our main result, Thm 3.4, establishes the convergence of the spectral den- 
sity of random kernel matrices under the condition that the kernel function 
belongs to a weighted L 2 space, is properly normalized and satisfies some ad- 
ditional technical conditions. The limiting spectral density is characterized by 
an algebraic equation, Eqn. (3.5), of its Stieltjes transform. The equation in- 
volves only three parameters, namely v, a and 7. The parameter v is the limit 
of p-Var{f(Xf Xj)) and simply scales the limiting spectral density. The param- 
eter a is the limiting coefficient of the linear term £ in the expansion of 
into rescalcd Hermite polynomials, and has some non-trivial effect on the shape 
of the limiting spectral density. The result concerning the weak convergence of 
the spectral density in [9] can be regarded as a special case of our result. Specif- 
ically, [9] proves that for a locally smooth kernel function, the limiting spectral 
density is dictated by its first-order Taylor expansion. The linear term in our 
rescaled Hermite expansion asymptotically coincides with the first-order term 
of the Taylor expansion. See also Remark 3.8 after Thm. 3.4. 

Notice that the entries of the random kernel matrix are dependent. For exam- 
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pie, the triplet of entries (j,k) and (k, i) are mutually dependent. In the 

literature of random matrix theory (RMT), random matrices with dependent 
entries have received some attention. For example, the spectral distribution of 
random matrices with "finite-range" dependency among entries is studied in 
[3]. However, we do not find studies of this sort to be readily applicable to the 
analysis of the random inner-product kernel matrices considered here. We em- 
phasize that our result only addresses the weak limit of the spectral density, 
while leaving many other questions about random kernel matrices unanswered. 
These include the analysis of the local statistics of the eigenvalues, the limit- 
ing distribution of the largest eigenvalue, and universality type questions with 
respect to different probability distributions for the data points. 

The rest of the paper is organized as follows: in Sec. 2 we review the definition 
and properties of the Stieltjes transform (Sec. 2.1), and revisit the proof of the 
M.P. law using the Stieltjes transform (Sec. 2.2). Sec. 3 includes the statement 
of our main theorem, Thm. 3.4, and the result of some numerical experiments. 
The proof of Thm. 3.4 is established in Sec. 4. Finally, the concluding remarks, 
discussion and open problems are provided in Sec. 5. 

Notations: For a vector X, we denote by |X| its I 2 norm, i.e. for X = 

(Xi,-- - ,X p ) T in MP, \X\ = ^Jx 2 + --- + X 2 . We write x = 0(l)p a to in- 
dicate that \x\ < Cp a for some positive constant C and large enough p (which 
also implies large enough n since p/n = 7). Also, O a (l) means that the con- 
stant C depends on the quantity a, and the latter is often independent of p. 
Throughout the paper, £ stands for a random variable observing the standard 
normal distribution. 



2. Review of the Stieltjes Transform and the M.P. Law 
2.1. The Stieltjes Transform 

For a probability measure dfi on K, its Stieltjes transform (also known as the 
Cauchy transform) is defined as (see, e.g. Appendix B of [4]) 

m{z) = [ ——dfi(t), 9f(z) > 0, 

and hence 3(m) > 0. The probability density function can be recovered from 
its Stieltjes transform via the "inversion formula" 

]im -Q(m(t + ib)) = ^(t), (2.1) 
b^o+ it at 

where the convergence is in the weak sense. 

Point-wise convergence of the Stieltjes transform implies weak convergence 
of the probability density (Thm. B.9 in [4]). This is the fundamental tool that 
we use to establish the main result in our paper. For the n-by-n random kernel 
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matrix A, its empirical spectral density (ESD) is denned as 

n 

ESD A = -J2Sx i (A)(X)dX, (2.2) 



n 

i=l 



where {Ai(A), i = 1, • • • , n} are the n (real) eigenvalues of A. Considering ESD a 
as a random probability measure on R, we have its Stieltjes transform as 

i " 1 i 

m x (z) = -£— - = -Tr(A-z/)- 1 , 3(z) > 0. (2.3) 

i=i x 1 

To show the convergence of ESD a, in expectation (or in a.s. sense), it suffices to 
show that, for every fixed z above the real axis, rriA{z) converges to the Stieltjes 
transform of the limiting density in expectation (or in a.s. sense). 

Another convenience brought by fixed z is the uniform boundedness of many 
quantities. Specifically, for z = u + iv, v > 0, 

1 " 1 1 ™ 1 1 

|m A (z)| < - T"TTT\ 1 < - E - = 

n ' \\i(A) — z\ n ' v v 

i—l 1 v J 1 i— 1 

Also, 

\{{A-zI)- l ) u \ < 1, l<t<n, (2.4) 
which follows from the spectral decomposition of A. 



2.2. Proving the M.P. Law using the Stieltjes Transform 

Thm. 2.1 is the version of the M.P. law for random kernel matrices with a linear 
kernel function. The version for Wishart matrices is well known and its proof 
can be found in many places, see e.g. [4, Chapter 3.3]. 

Theorem 2.1 (the M.P. law for random linear-kernel matrices). Suppose that 
Xi ~ Af(0,p~~ 1 I p ). Let A be the random kernel matrix as in Eq. (1.1) with 
the kernel function /(£) = a£ where a is a constant. Then the limiting spectral 
density of A is 

Pi(t) = -pm.p. ( ;- . (2.5) 

a \ a j J 

The density function pM.p.(t;y), with positive constant y as a parameter, is 
defined as 

Pm.pM; y) = 1 - - S (t) + — , (2.6) 

V VJ 2 nyt 

where (x) + = max{cc,0}, b(y) = (1 + ^fy) 2 and a(y) = (1 — ^Jy) 2 . The conver- 
gence of ESD a to pi(t)dt is in the weak sense, almost surely. 
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Remark 2.2. In Eq. (2.5), the rescaling by a is due to the constant a in front of 
the inner-product, and the shifting by a is due to our setting diagonal entries 
to be zero. Also, Eq. (2.6) is slightly different from the M.P. distribution in 
literature, since the random kernel matrices that we consider are n-by-n and 
the variance of XfXj is while Wishart matrices are p-hy-p and have a 
different normalization. 

Remark 2.3. The distribution of the largest eigenvalue (i.e. the spectral norm, 
denoted as s(A)), a question independent from the limiting spectral density, is 
well-understood for Wishart matrices, and thus applies to Gram matrices. It 
has been shown that the largest eigenvalue converges almost surely to its mean 
value, following a stronger result about the limiting distribution of the largest 
eigenvalue, namely the Tracy- Widom Law [10]. The Tracy- Widom Law of the 
largest eigenvalue has been shown to be universal for certain sample covariancc 
matrices with non-Gaussian entries, see e.g. [17, 8]. As a result (the smallest 
eigenvalue of a Wishart matrix is always non- negative) , asp, n — > oo,p/ra = 7, 
almost surely s(A) < &(7 _1 ) + 1, which is an 0(1) constant depending on 7 
only. 

Another way to characterize Eq. (2.5) is that m/(z), the Stieltjes transform 
of pi(t), satisfies the following quadratic equation 



In the literature, Eq. (2.7) is sometimes called the M.P. equation. It has been 
shown that Eq. (2.7) has a unique solution with positive imaginary part (Lemma 



We reproduce the proof of Thm. 2.1 here, since some key techniques will be 
used in proving our main result. 

Proof of Thm. 2.1. In two steps it can be shown that itia(z), as defined in 
Eq. (2.3), converges almost surely to the solution of Eq. (2.7). Without loss of 
generality, let a = 1. 

Step 1. Reduce a.s. convergence to convergence of Eitia(z). 

Lemma 2.4 (concentration of ttla at 'Km a)- For the n-by-n random kernel 
matrix A as in Eq. (1.1), where Xi's are independent random vectors, and a 
fixed complex number z with ^s{z) > , we have that as n — > 00. 




(2.7) 



3.11 of [4]). 



m A(- z ) — Em J 4(z) — > 



almost surely, and also 



K\m A - Km A \ < 0(l)n~ 1/2 . 



(2.8) 



The above lemma relies on that ^s(z) > and that the Xi's are independent, 
while there is no restriction on the specific form of the kernel function, nor on the 
distribution of Xi. The proof (left to Appendix B) uses a martingale inequality, 
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combined with the observation that among all the entries of A only the fc-th 
column/row depend on Xk- 

Step 2. Convergence of Em^fz). Observe that 



Em A (z) = ..-Tri.l - zlY 1 
n 

= E±±((A- Z I)-i) 

i=l 

= E((A-zI)- 1 ) nn , 



where the last equality follows from that the rows / columns of A are exchange- 
able and so are those of (A — zl)~ x . We then need the following formula 

{{A Zl) ~ 1)nn = (A nn -z)-AF n (A^-zI n _ 1 )-iA., n ' (2 ' 9) 

where A^ is the top left (n — 1) X (n — 1) minor of A, i.e. the matrix A is 
written in blocks as 

A^ A,„ 
A T A ' 

v\ Sinn 



A 



and I n -i is the (n — 1) X (n — 1) identity matrix. Notice that since > 0, 
both A — z I and A^ — zl n ~i are invertible. Formula (2.9) can be verified by 
elementary linear algebra manipulation. 

By Eq. (2.9) (recall that A nn = from Eq. (1.1)), 

Em A (z) — E ((A — zl)-\ n =E _ z _ ■ (2.10) 

To proceed, we condition on the choice of X n , and write 

Xi = rn(X n ) +Xi, l<i<n-l, (2.11) 

where (X n )o = jx^ 1S the urn ^ vector in the same direction of X ni and Xi lie 
in the (p — l)-dimcnsional subspace orthogonal to X n . Due to the orthogonal 
invariance of the standard multivariate Gaussian distribution, we know that 
rji ~ A/"(0,p _1 ), Xi ~ A/"(0,p _1 /p_i), and they are independent. Now we have 

XjX n = m\X n \, 1 < i < n - 1, (2.12) 

and 

XfXi = mVj + XjX, , 1 < i, j < n - 1, i ? j. (2.13) 

Define ?; = (771 , • • • , ?7n-i) T , D n = diag{r]f, • • • , T}^_x\ which is a diagonal ma- 
trix. Also, define 

.I;? \^ Xj - l<*,i<n-l. (2.14) 
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Then 



A T JAW - zln-x^A.^ = \X n \ 2 rf{ m T ~D V + A» - zI n -iY^ 



>7 



ix„r 1 



l + ,f(AW-D v -zI n -i)- 1 v 
1 



l + ^^-A,-^-!)- 1 ?// ' 



(2.15) 



where to get the 2nd line we use the Sherman-Morrison formula 

q T (M - zl)- 1 



q T (pq T + M - ziy 1 = 



1 + q T (M - «I)-y 



Vp,g. 



By showing that the denominator in Eq. (2.15) is asymptotically concentrat- 
ing at the value of Era(z), where m(z) := iTr(A^") — z/„„i) _1 , we end up 
with 



3 



1 + -Erh(z) 



-> 0. 



The detailed derivation is left to Appendix (Lemma B.l). Notice that the prob- 
ability law of rji and Xj Xj do not depend on the position of X n , so we omit 
the conditioning on X n when computing the probabilities and expectations. 
Furthermore, by Lemma B.6, 



E\m A (z) - fh(z)\ -> 0, 



thus 



Erh(z) - ( -z - [ 1 - ( 1 + -Em(z] 



0. 



(2.16) 



(2.17) 



Since the quadratic Eq. (2.7) has a unique solution mi{z) with positive imagi- 
nary part, Eq. (2.17) means that 



Krh(z) — > mi(z). 
At last, by Eq. (2.16), mi(z) is the limit of Em J 4(z). 



□ 



3. Random Inner-product Kernel Matrices 
3.1. Model and Notations 

Let X\ , • • • , X n be i.i.d random vectors in M. p and assume that Xj ~ Af(0, p^ 1 I p ). 
The random kernel matrix A is defined in Eqn. (1.1) with the kernel function 
f{£',p), and we define 

k(x;p) = Jpf(JL. p ). (3.1) 
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In many cases of interest f(£;p) does not depend on p, or the dependency is in 
the form of some rescaling or normalization. However, we formulate our result 
in a general form, keeping the dependency of k(x;p) on p, and require k(x;p) to 
satisfy certain conditions. We will see that those conditions are often satisfied 
in the cases of interest (Remark 3.2 and Remark 3.3). 

Let X and Y be two independent random vectors distributed as A/"(0,p _1 J p ), 
and define £ p = y/pX T Y. Denote the probability density of £ p by q p (x), and the 
L 2 spaces T-L p = L 2 (R, q p (x)dx). Let {Pi tP (x), I = 0, 1, ■ ■ • } be a set of orthonor- 
mal polynomials in H p , that is 



/ Pi 1 ,p(x)Pi 2 , p (x)q p (x)dx = Si u i 3 



where 5i t k equals 1 when I = k and otherwise. We define Pi p (I > 0) using 
the Gram-Schmidt procedure on the monomials {1, x, x 2 , . . .}, so that Pq iP = 1, 
Pi,p = x (notice that E£ 2 = 1), and P; jP is a polynomial of degree /. Notice that 
by the Central Limit Theorem, £ p — > 7V(0, 1) in distribution asp —> oo. We define 
H_\f = L 2 (M.,q(x)dx) where q(x) = -j^ e ~ x ^ 2 ■ It can be shown (Lemma 4.1) 
that for any finite degree Z, the coefficients of the polynomial Pi jP (x) converge 
to those of the normalized Z-degree Hermite polynomial, the latter being an 
orthonormal basis of T~Lm . 

We formally expand k(x;p) as 

oo 

k(x;p) = ^ a l,pPl,p( x )i 

'=0 (3.2) 

a i,p = / k(x;p)Pi tP (x)q p (x)dx, 



and will later explain how to understand this formal expansion. Corresponding 
to the Z-th term in Eqn. (3.2), we define the random kernel matrix Ai to be 

Mi i = {* iXTXi '' P)i (3-3) 
[0, 1=3, 

where fifcp) = ^Pi, p (VpO- 



3.2. Statement of the Main Theorem 



Our main result is stated in Thm. 3.4, which establishes the weak convergence of 
the spectrum of random inner-product kernel matrices. The following conditions 
are required for k(x;p): 

1. (C. Variance) For all p, k(x;p) € Hp, and as p — > oo, Var{k{^ p ;p)) = 
v p — > v which is a finite non-negative number. We also assume that a p = 
E/c(£ p ;p) = (Remark 3.5). 
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2. (C.p-Uniform) The expansion in Eqn. (3.2) converges in H p uniformly 
in p. Equivalently, let 

L 
1=0 

then for any e > 0, there exist L and po such that Ym=l+i a ? P < 6 f° r 
P > Po- 

3. (C.ai) As p — > oo, ai jP — > a which is a constant. 

Remark 3.1. By condition (C. Variance), the integrals in Eqn. (3.2) are well- 
defined. The requirement v p — > f can be fulfilled as long as k(x;p) € "H p and is 
properly scaled. Notice that ^ p = Var[k{S, p ;p)) = X)ti a ?pj thus in condition 
(C.ai), a 2 < i/. 

Remark 3.2. When k(x;p) = k(x), and if (1) fc(ai) € Hjv, and Efc(C) = where 
C ~ Af(0, 1), and (2) k(x) satisfies 

k(x) 2 \q p {x) - q(x)\dx -> 0, p — ^ oo, (3.4) 

then the three conditions are satisfied and f p — > ^ ;= Efc(£) 2 , and ai. p — > 
aj^f := E£fc(£) (Lemma C.2). Eqn. (3.4) holds as long as the singularity in the 
integral, say at x = oo or k(x) = oo, can be controlled p- uniformly. This is the 
case, for example, when k(x) is bounded, or when k{x) is bounded on |x| < R 
for any R > and k(x) 2 is p-uniformly integrable at x — > oo (Lemma C.5). It 
is also possible for k(x) to be unbounded. See Sec. 3.3 for an example of k{x) 
that diverges at x = 0. 

Remark 3.3. When /(f,p) = /(£), the three conditions generally need to be 
checked for k(x;p) case by case. For the special situation where /(£) is C 1 at 
£ = 0, see Remark 3.8. 

Theorem 3.4 (the limiting spectrum of random inner-product kernel matrices). 
Suppose that Xi,--- ,X n ~ A/"(0,f> _1 ./p) are i.i.d., and k(x;p) satisfies condi- 
tions (C. Variance), (C.p-Uniform) and (C.ai). Then, as p,n —> oo with 
p/n = 7, ESDa (the empirical spectral density of the random kernel matrix 
A, defined in Eqn. (2.2)) converges weakly to a continuous probability measure 
on M in the almost sure sense. The Stieltjes transform of the limiting spectral 
density is the solution of the following algebraic equation 

1 ( 1 \ v - o? 

=z+a[l-- — r ^]+ m(z), (3.5) 



m(z) \ 1 + ^m(z) I 7 

which is at most cubic, and involves three parameters: v (defined in (C. Variance) ), 
a (defined in (C.a\)) andj. Eqn. (3.5) has a unique solution m(z) with positive 
imaginary part (Lemma A.l), and the explicit formula of 



y{u) := lim 3(m(u + iv)) (3-6) 
v— >o+ 



is given in Appendix A. 
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Remark 3.5. We assume ao jP = 0, since otherwise it results in adding to the 
kernel matrix a perturbation -^=ao tP (l n l^ — In): where l n is the all-ones vector 
of length n and I n is the identity matrix. The limiting spectral density of a 
sequence of Hcrmitian matrices with growing size (n — > oo) is invariant to a 
finite-rank perturbation (with rank that does not depend on n), see Thm. A. 43 



Remark 3.6. Recall the definition of A\ in Eqn. (3.3). The limiting spectral 
density of A\ is the M.P. distribution. For this case, = a£, or equivalently 

k(x;p) = ax, for some constant a. Then, the expansion in Eqn. (3.2) has one 
term, <xi iP = a, v p = a 2 , and Eqn. (3.5) is reduced to Eqn. (2.7). 

Remark 3.7. The limiting spectral density of Ai (I > 2) is a semi-circle. More- 
over, the limiting density of any partial sum (finite or infinite) of A2, A3, ■ ■ ■ is 
a semi-circle, whose squared radius equals the sum of the squared radii of the 
semi-circle of each Ai. 

Remark 3.8. For random kernel matrices with locally smooth kernel functions, 
the limiting spectral density is the M.P. distribution. Specifically, if f(£;p) = 
/(£), and is locally C 1 at £ = 0, one can show (Lemma C.3) that the result in 
the theorem holds and a 2 = v = (/' (0)) . In other words, the linear term in 
Eqn. (3.2) determines the limiting spectral density, in agreement with the result 



The proof of Thm. 3.4 is given in Section 4. Before presenting the proof, we 
analyze some examples of kernel functions numerically. 

3.3. Numerical Experiments 

We compare the eigenvalue histogram and the theoretical limiting spectral den- 
sity numerically. In the subsequent figures, the eigenvalues that produce the 
empirical histogram are computed by MATLAB's eig function and correspond 
to a single realization of the random kernel matrix. The "theoretical curve" is 
calculated using the "inversion formula" Eqn. (2.1) and Eqn. (A. 2), which is the 
expression for y(u; a, u, 7) defined in Eqn. (3.6). 

3.3.1. Example: Sign(x) 

As an example of a discontinuous kernel function, let 



where Sign(x) is 1 when x > and -1 otherwise. Since = 1, k{x) is 

bounded, and according to Remark 3.2, by Lemma C.2 and Lemma C.5, k(x) 
satisfies conditions (C. Variance), (C.p-Uniform) and (C.ai). Meanwhile, 
a = E|C| = 1/2/77, and v p = 1 for all p, thus v = 1. 




Fig. 1 is for Xi ~ J\f(0,p 1 I P ). Notice that for the sign kernel, the two 
models Xi ~ A/"(0,p _1 / p ) and Xi ~ U{S P ~ 1 ) result in the same probability 



in [4]. 



in [9]. 



k(x\p) = k(x) = Sign(x), 
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Fig 1 . Random kernel matrix with the Sign kernel, and X; ~ A/*(0, p 1 I p ). (Left) p = 4 X 10 2 , 
n = 4 X 10 3 , 7 = p/n = 0.1. (Right) p = 8 X 10 3 , n = 4 X 10 3 , 7 = p/n = 2. Tfee 62ue- 
boundary bars are the empirical eigenvalue histograms, and the red broken-line curves are the 
theoretical prediction of the eigenvalue densities by Thm. 3-4- 



law of the random kernel matrix. This is due to the fact that Sign{Xf Xj) = 
Sign{{X l /\X l \) T {X j /\X J \)) and that if X, ~ M{%p~ l I v ) then Xi/\Xi\ ~ 
As such, the results for X{ ~ U(S P ^ 1 ) are omitted. 

The following serves as a motivation for the sign kernel matrix. Consider a 
network of n "subjects" represented by Xi, . . . , X n lying in R p . Subjects i and j 
have a friendship relationship if they are positively correlated, i.e., if XfXj > 0, 
and a non- friendship relationship if XfXj < 0. The off-diagonal entries of the 
n-hy-n kernel matrix A are all ±1 representing the friendship/non- friendship 
relationships. This model has the merit that if i and j are friends, and j and 
k are also friends, then chances are greater that i and k are also friends. When 
the Xi& arc i.i.d uniformly distributed on the unit sphere in W and p is fixed, 
according to [11], as n grows to infinity the top p eigenvectors of the kernel 
matrix A converge, up to a multiplying constant and a global rotation, to the 
coordinates of the n data points. In this case, the eigen-decomposition of the sign 
kernel matrix recovers the positioning of the subjects in the whole community 
from their pairwise relationships. On the other hand, Thm. 3.4 covers the more 
realistic case of the "large p, large n" regime. 

3.3.2. Example: \x\~ r (r < 1/2) 

As examples of unbounded kernel functions, we study the even function 

k e (x) = \x\- r ~E\C\- r 

and the odd function 

k (x) = Sign(x)\x\~ r , 

where r < 1/2 so as to guarantee the integr ability of k(x) 2 at x = 0. 

Notice that for both cases, \k(x)\ is bounded on {|a;| > R} for any R > 0, 
and diverge at x = 0. Meanwhile, k(x) 2 = \x\~ 2r is integrable at x = 0, and 
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fg)=p-'\(p" 2 i;), p =2e3, n=4e3, X. ~ N(0,p" 1 l p ) f(i;)=p- 1 ' 2 k o (p 1 ' 2 !;), p =2e3, n=4e3, X. - N(0,p-'l p ) 

0.02 
0.015 
I 0.01 
0.005 

-2 -1 1 2 -4 -2 2 4 6 

X I 

Fig 2. Random kernel matrix where k(x) = k e (x) = \x\~ 1 / A — E|f| -1 / 4 (left) and k a (x) = 
Sign(3:)M~ 1/4 (right). X t ~ Af(0,p _1 /p), and p = 2 x 10 3 , n = 4 x 10 3 , 7 = p/n = 0.5. 

with the fact that q p {x) < q p (0) -> g(0) = I/v^tt, Eqn. (3.4) still holds. Thus 
by Lemma C.2, Thm. 3.4 applies to both k e and k Q . By 

where T(-) is the Gamma function, and similarly for E|C| _2r , the constants v 
and a for both k e and fc c can be explicitly computed. For k e , v = Var(\^\~ r ) 
and a = 0. For fc , v = |CI~ 2l \ and 

a = EKI 1 ^ = / 2 r(l - |). 

The numerical results for r — 1/4 with ~ A/"(0,p _1 / p ) arc shown in Fig. 
2. The empirical histograms for Xi ~ U(S P ~ 1 ) look almost identical and are 
therefore omitted. In the left panel of Fig. 2, the empirical spectral density is 
close to a semi-circle, as our theory predicts. Notice that for r = 1/4, the off- 
diagonal entries of the random kernel matrix do not have a 4th moment. However 
this does not contradict the "Four Moment Theorem" for random matrices with 
i.i.d entries [20] since the model for random kernel matrices is different. 

4. Proof of the Main Theorem 

The model and the notations are the same as in Sec. 3.1. The proof of Thm. 
3.4 is provided in Sec. 4.3. Prior to the proof, in Sec. 4.1 we review some useful 
properties of Hermite polynomials, and in Sec. 4.2 we introduce an asymptotic 
upper bound for the expected value of the spectral norm of random kernel 
matrices. The other model Xi ~ U(S P ~ 1 ) is analyzed in Sec. 4.4, where it is 
shown that the result of Thm. 3.4 still holds. 
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4-1. Orthonormal Polynomials 



4-1.1. Hj\f and normalized Hermite polynomials 



Define the normalized Hermite polynomials as 

hi{x) = -^=Hi{x), J = 0,1,..- 
where Hi(x) is the /-degree Hermite polynomial, satisfying 



(4.1) 



/, 



H h (x)H[ 2 (x)q(x)dx = S iuh ■ hi. 



Thus, {hi(x), I = 0, 1, ■ • • } form an orthonormal basis of Hj\f. The explicit for- 
mula of Hi is [1] 



Also, the derivative of Hi(x) satisfies the recurrence relation H/(x) = lHi_\(x) 
for I > 1, and as a result, 



4.1.2. Tip and Pi, p {x) 

Recall that the random variable £ p converges in distribution to Af(0, 1) asp-> 
00. Meanwhile, the moments of £ p approximate those of 7V(0, 1): 



Eq. (4.3) is verified by directly computing the moments of £ p using the model, 
i.e. £ p = ^/pX T Y and X and Y are independently distributed as A/"(0,p _1 /j,). 
With the following lemma, Eq. (4.3) implies the asymptotic consistency between 
Pi. p and hi. 

Lemma 4.1 (convergence of P; iP to hi). Let {Pi p , I = 0,1, ■ ■ ■} be the orthonor- 
mal polynomials of L 2 (R,d[jLp), where [i p is a sequence of probability measures. 
Suppose that the moments of jx p approximate those of M{0, 1) in the sense that, 
for every fixed k , 




h/(x) = wlhi-i(x). 



(4.2) 




k even; 
k odd. 



(4.3) 




-1 



Then, for every fixed degree I, 




3=0 
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where (Si tP )j satisfy 

max | (<$,,„),■ | <O z (l)p- 1 . 

0<j<l 

The proof of Lemma 4.1 follows from the fact that the coefficients of the 
^-degree orthogonal polynomials are decided by up to the first 21 moments. 
One consequence of Lemma 4.1 is that as p — > oo 

\Pl, P (x)\<Oi(l)M l , \x\<M, (4.4) 

as the coefficients of P^ p (x) for each I converge to those of hi(x). Also, Eq. (4.2) 
leads to 

Pi p (x) = SlPl-uix) + CMIJM'-V 1 , 

p'U x ) = VKi - ^Pi-zA*) + Oi(i)M l - 2 P -\ 

Another consequence is the "asymptotic consistency between the P/ iP -expansion 
and the Hcrmitc-cxpansion" in their first finite-many terms (Lemma C.l). This 
further implies that conditions (C. Variance), (C.p-Uniform) and (C.ai) are 
satisfied by a large class of kernel functions (Remark 3.2). 



4-2. Spectral Norm Bound 

The following lemma gives an upper bound for the expectation of the spectral 
norm of random kernel matrices whose rescaled kernel function k(x;p) is Pi_ p (x) 
(defined in Sec. 4.1) for some I. The method is by analyzing the 4th moment of 
the random matrix. 

Lemma 4.2 (bounding mean spectral norm by the 4th moment). Let A be the 

random kernel matrix defined in Eq. (1.1) with the kernel function f(£;p) = 
p~ 1 / 2 P l p ( s /p£ i ) ; I > 1, where Pi tP is defined as in Sec. 4-1- Xi's are i.i.d. dis- 
tributed as N{0,p~ l I p ). Then, as p, n — > oo, p/n = 7, 

Es(A) < Oj^ljn 1 / 4 . 

Remark 4.3. We are aware of the existence of significant literature on the spec- 
tral norm of random matrices. The asymptotic concentration of the largest eigen- 
value at its mean value is quantified by the Tracy- Widom Law for Gaussian 
ensembles (see, e.g. [2, Chapter 3]) and a large class of Wigner-type matrices 
(see [16], [19] and references therein), as well as Wishart-type matrices (Remark 
2.3). For random kernel matrices, s(Ai) is conjectured to be 0(1), and see more 
in Sec. 5. However, the bound provided by Lemma 4.2, though not tight, is 
sufficient for the proof of our main theorem. 

Proof of Lemma 4-2 ■ Let {Ai, 1 < i < n} be the eigenvalues of A. Since 



n 

s(A) 4 <J2 X t = Tr (^ 4 ) = E MjA^AuAu, 

i—1 i,j,k,l 
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we have 

Es(A) < (Es(A) 4 ) 1 / 4 < ( J2 EAijAjkAuAu) 1 '*. (4.6) 

i,j,k,l 

We observe that for EAijAj k A k iAu to be non-zero, in {i,j, k, 1} neighboring 
indices must differ since An = 0. Then we have the following cases: 

1. % = k,j = I: EAijAjkAkiAu = EAf 3 = p- 2 EP,, p (^5£ 12 ) 4 = O^p- 2 , 
where the last equality is due to Eqn. (4.3) and Lemma 4.1. 

2. i = k,j ? I or i ^ k,j = I: EA l3 A jk A kl A H = p~ 2 {EP ltP {^ 12 ) 2 ) 2 = 
{l + 0,(l)p- l ) a p- a = Oi(l)p- 2 . 

3. i ^ k,l ^ j: when I = 1, EAijAjkAkiAu = p~ 3 . When I > 2, we have the 
following estimate (Lemma D.l) 



Thus, when 1 = 1. 



EA ij A jk A kl Au = Oi(l) p - 



EAijA jk A k iA u 

i,j,k,l 

< n 2 (D(l)p- 2 + 2n 3 0(l)p- 2 + n 4 p- 3 
= 7 (l)n + 7 (l), 



and when I > 2, 



^2 EAijAjkAkiAu 

i,j,k,l 

< n 2 Oi{l)p~ 2 + 2n 3 O l {l)p~ 2 + n A Oi{l)p- A 
= 0, l7 (l)n + 0,, 7 (l). 

Combining the above estimates with Eq. (4.6) leads to the bound wanted. □ 



4.3. Proof of Thru. 3-4 

Proof of Thm. 3.4- Same as in Sec. 2.2, it suffices to show the mean convergence 
of the Sticltjes transform. Specifically, we want to show that for a fixed z = u+iv, 
Em,A{z) converges to the unique solution of Eq. (3.5). Recall that the expansion 
Eq. (3.2) converges p-uniformly in H p , and we first reduce the general case to 
that where the expansion has finite many terms. 

Step 1. Reduction to the case of finite expansion up to order L. 

Denote the truncated kernel function up to finite order L by = 
P~ 1 ^ 2 kL{\/p£,\p) where (recall that ao, p = by Remark 3.5) 

L 

k L (x\p) = y^a; ; pP; !P (x). 
1=1 
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Let m A (z) and mi,(z) be the Stieltjes transforms of the random kernel matrix 
with the kernel function /(£;p) and /l(£;p), respectively. For a fixed z, define 

RHS(m; a, u) = ^-z - a - ^ ~ ^ ~ ° m j • ( 4 - 7 ) 

The goal is to show that, as p, n — > oo with p/n = 7, Eto^ converges to the 
solution of Eq. (3.5) which can be rewritten as m = RHS(m; a, v), and it suffices 
to show that 

\Em A - RHS(Em A ;a,v)\ -> 0. (4.8) 

We need the following lemma, whose proof is left to Appendix D: 

Lemma 4.4 (stability of the Stieltjes transform to L 2 perturbation in the kernel 
function). Suppose that Xi (i = 1, • ■ ■ , n) are i.i.d random vectors, and the two 
functions /a(£;p) and /s(£;p) satisfy that with large p 

E(f A (X T Y;p) - f B (X T Y;p)) 2 < ep'\ 

where X and Y are two independent random vectors distributed in the same way 
as Xi 's, and e is some positive constant. Let A be the n-by-n random kernel 
matrix with the kernel function /a(£;p), and B with /s(£;p). Also, let m A and 
tub be the Stieltjes Transforms of A and B respectively. Then for a fixed z, 

E\m A {z) - m B {z)\ < 0{l)^l. 

By condition (C.p-Uniform), for arbitrary e > 0, there exists some L = 
L(e), so that E(fc(£ p ;p) — fc,L(e)(£p;p)) 2 < e 2 for all p, and then 

E(f(X T Y-p) - f L{e) (X T Y-p)) 2 < e 2 p-\ 

By Lemma 4.4, 

\Em A (z) - Em L(e) (z)\ < E\m A (z) - m L(e) (z)\ < 0(l)e. 

If in addition we can show that, for any fixed L and some sequence of (Il(p) and 
"i(p), 

\Em L - RHS(Em L ;a L {p),v L {p))\ 0, 
a L (p) -> a, v L (jp) -t v, 

then Eq. (4.8) holds asymptotically. 

Step 2. Convergence of Emr J (z) for finite L. 

With slight abuse of notation, we denote the random kernel matrix with 
kernel function /l(£;p) by A. Its Stieltjes transform is denoted by m/,(z). In 
what follows we sometimes drop the dependence on p and write /l (£; p) as fh (£) , 
and similar for other functions. 
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Recall that 

Em L (z) =E({A- zl)' 1 ) 



= E(-z - A T n (A^ - zI^-'A.^)- 1 . 
Notations as in Eq. (2.11, 2.12, 2.13), we have 
A., n = /(i) + f(2), 

/(i) := a l!P \X n \r], (4.11) 

/(2) : = (/>l(^ln)) " " " ! f>l(£,n-l,n)) T , 

where = \X n \rji for 1 < i < n - 1, 77 := (771, • • • ,Vn-i) T , and />i(£) := 
/l(£) — ai iP £;. The off-diagonal entries of A^ are 

4" } = fd x Jx 3 ) = h(mm + in), i<i,j<n-i,i? j, 

where £y = Xj Xj . 

The typical magnitude of r/i and &j is p" 1 / 2 , and specifically, we have the 
large probability set Qg defined as 

n S = {\Vi\ <S,\iij\ <S,\\X n \ 2 -l\ < V26,l <i,j<n-l,i jLj}, (4.12) 

where 8 = M = y/20h^. By Lemma D.2, Pr(0§) < 0{l)p- 7 . On fl s , 

1 1 ('/<'/.. + iij) = ":.;■'/<'/.. + "i./,i r ,, 

+ />l(40 + f>i(iij)ViVo + %i 

where 

Recall that />i(£) = ^= £f=2 a L P p L P (VP~0, and b Y E q- ( 4 -5), 

L 

/>i(£) = Soj.pCVifl-i^CvKj + OjCljM'-V 1 ), (4.13) 

Z=2 

and 

L 

/>i(0 = V^E MV^ 7 ^-^^) + OK^M'-V 1 )- (4.14) 



We define 



i=2 



^? ) =oi i p&+/>i(&), 



y/P 



i + 3 7 



=2 
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and set the diagonal entries to be zeros for both AS n ' and F, then 
A (n) = ^(») + aiiP { m T - D„) + JpWFW + T, 

where T is Hermitian with = t i j+-^=f >1 (£ij)—Fij, and = diag{?yi, • • • , ?? n _i}. 

We have (recall that Ylt=i a f, P 1S bounded by some 0(1) constant for all p, by 
Remark 3.1) 

1. Since Qij is between ^ and £y + r/ir/j, and both £y and 77, are bounded 
in magnitude by (5 = p-^M, then |%| < (5 + 5 2 < 1.015 = p-^l.OlM. 
Thus, by Eq. (4.14, 4.4), 1/^(^)1 < y/pO L {l)M L ~ 2 , and then \t^\ < 
<D L (l)M L+2 p~ 3 / 2 . Together with Eq. (4.13), 

\T tj \ < O l (1)M l+2 P - 3/2 +0 L (l)M L - 1 p- 3/2 

= L (l)M L+2 p- 3/2 . (4.15) 

As a result, 

s(T - ai, p D v ) ■ 1q s < s(T) ■ 1q s + \ax lP \S 

= L (l)M L+2 p-V 2 + 0(l)Mp- 1/2 

= O l {1)M l+2 P - 1/2 . (4.16) 

2. F can be written as X)^ 1 a lpVlFh where Lemma 4.2 applies to each Fi, 
and the coefficients a; iP for 1 < I < L — 1 arc uniformly bounded by some 
constant since Yli=i a f p = v v v by Condition (C. Variance). Thus we 
have 

L 

Es(F) < J2 VlO^pV 4 = O h {\)p x l\ (4.17) 

1=2 

and as a result, 

Es(y/pWFW) ■ 1q 5 < MV 1/2 Es(#) < L (l)M 2 p- 1/4 . (4.18) 

Now we break the quantity A T n {A yT ^ > — z/„_i) _1 A.„ into the following pieces: 

define A^ = a\ p riri T + A^ , and recall that A. n = /(i) + /(2) as defined in Eq. 
(4.11), 

A T JA^ - Zln-^A.^ =A T tn (A^ - zln^An - A T JA^ - Zln-l)- 1 

■ {^p-WFW + T- a hp D v ){A^ - zI^^A.^ 
=ff 1) {A^-zI n . 1 )- 1 f {1) 

+ ff 2) {A^ - zln-t)- 1 /^ +r 2 ~ n (4.19) 



where 



r 2 = 2/ ( ^ ) (i(")-z/ n _ 1 )- 1 / (2) , 

n = - zI^xJ-H^WJW + T - ai, P A,) (4.20) 
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For r 2 , 

r 2 = 2a 1 jX n \r ] T (A^ ~ zl^)- 1 / (2) 
= 2a 1 , p /£ ) (i("> - zI^YWX^) 
= 2ai, p {f[ 2) (A^ -zl^rWX^rj) 

- / ( T 2) (i^ - zI^. 1 )- 1 a 1J m r ^ - zI n ^)-\\X n \rj)} 
:= 2a ltP (r 2 ,i - r 2 , 2 ), (4.21) 

and by moment method we can show that (Lemma D.3) 

E|r 2 | • ln 5 < l (1)MV 1/2 - (4.22) 

To bound n, we restrict ourselves to f2<5 where ||A >n || 2 = J27=i /i(£m) 2 — 
O l (1)M L , and with Eq. (4.18, 4.16) 

E|ri| • l n , < E(s(^WFW) + s{T - oi lP D,))||A, n || 2 • l n , 
< O l (1)M L E(s{^WFW) + s(T - aiM) 
= L (l)M L (0 L (l)M 2 p~ 1/4 + L {l)M L+2 p~ 1/2 ) 
= C L (l)M 2i+ V 1/4 . (4.23) 
Furthermore, as in Sec. 2.2, we can compute the first term in Eq. (4.19): 
- zl n -i)-\f (1) = \X n \ 2 al pV T (A^ - zl^)^ 

= \X n \ 2 a hp (l - (1 + ai. P ?7 T (i (,l) - z/n-i)" 1 ^)- 1 ) 

= \X n \ 2 a hp (1 - (1 + ai, p (7- 1 Em(^) + 7 _1 r + r^)) -1 ) , 

where rh(z) = ^4_Tr(i( n ) - zln-i)' 1 , and 

1. f = m(z) - Erh(z), E\f\ < (D(l)n- 1 / 2 by Lemma 2.4; 

2. The term 

m), 2 = ^ T (i (n) - ^-i)"^ - -Tr(i<") - z/^-r)- 1 

is similar to r 2 in Lemma B.l and satisfies E|rm a| < C(l)p _1 / 2 . 

Going through a process similar to that in Lemma B.l to bound the denomina- 
tors, including 

1. introducing a large probability set 

:= {\r\ < P - 1/4 , |r (1)>2 | < p" 1 / 4 }, p^fic^ < 0(1)^-1/*, 

so as to bound |(1 + ai, P 7 _1 Em(z)) _1 | on tt s H fi(i) by C(1)M 2 , 

2. making use of that |(1 + a\ p rj T (A^ — zl ri ^i)~ 1 'q)~ 1 \ on f2<5 is bounded 
by 0(1)M 2 , 
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we have 



- zln-i)- 1 /^) = a hp ^Em(z))- 1 ^ +r (1) , (4.24) 

where 

E|r (1) |-l n ,nn w < G(1)A/V 1/2 - (4.25) 
We turn to compute the second term in Eq. (4.19). We have 

- / ( T 2) (i (n) - »i n - l )- 1 a*jrn T (AV> - zJ„_i)-V(2) 

= ^Em(z) + + r (2) , 2 - r (2) , 3 (4.26) 

where 

v>\# = E(/ (2 ))f = E/>i(6m) 2 = - 

and 



r (2)>2 = - ^- 1 )" 1 / ( T 2) - ^Tr(K") - ,/„_,) 

»"(2),3 = /(V^ - «J«-l)- 1 Ol J ,T W T (iW ~ Z/„_1)-VC2) 

^^(^(iw-zVir'/mKi- 



For r( 2 ), 2 , by a moment method argument similar to the first part in the proof 
of Lemma D.3, we have 

E|r (a) , a | <Ol(1K 1/2 - (4.27) 
To bound r(2) 3i we restrict ourselves to Sl^, where 

|/(a)(6n)| < Ol(1)MV 1/2 , M < Mp" 1 / 2 , l<i<n-l, 

thus 

| ai , p r, T (iM - z/n-i)- 1 /^)! ■ In, < 0(l) S ((i (n) - z/n-i)- 1 )^!! ■ ||/ (2 )|| 

< ^p-y/0(l)M*y/0 L {l)]\-P L = L (l)M L+ \ 

and then 

E|r (2)>3 | ■ In, = E|ra I i||ai J p(»7 r (iW - zln-i)- 1 f {2) )\ ■ l n , 
<0 L (l)A/ L+1 E|r 2a | 

< L (l)A/ L+1 p _1/2 - (4.28) 
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Now puting Eq. (4.19,4.23,4.22,4.24,4.25,4.26,4.27,4.28) together, we have 



E 



Lfi s nn (1) 



m L (z) - (-z - ai . p (l - (1 + ^Em(z))^j - ^Em(z)J 

< ~E(|n| + |r a | + |r (1) | + l^pT" 1 ^ + k(2),a| + k(2), 3 |) ■ ln,nn (I) 

< L (l)M 2i+ V 1/4 + o l (i)mV 1/2 + o(i)i/V 1/2 

+ 0(l)n" 1/2 + L (l)p" 1/2 + L (l)M L+1 p- 1/2 
= L {l)M 2L+2 p- 1/4 -> 0. (4.29) 

Meanwhile, similar to the proof of Lemma B.6 (making use of the fact that 
EsiyfpWFW + T) ■ l Qd < L (l)M 2 p- 1 / A and the inequality that Tr(AB) < 
n ■ s(A)s(B) for n-by-n Hermitian matrices A and B ), it can be shown that 

E\m L (z) - m(z)\ -> 0. 

With Eq. (4.29), we have (dropping the dependence on z) 

|Em - RHS(Em; a hp , v p )\ -> 0, 

and thus 

\Em L - RHS(Em L ]ai lP ,v p )\ -> 0. 

At last, by condition (C. Variance) and (C.ai), ai jP — > a and ^ p — > v. Thus 
Eq. (4.9) is verified if we set cil(j>) = ai, P and vl{p) = v p . □ 

4.4. Model Xi ~ UiSP- 1 ) 

We also consider the model where the random vectors AVs are i.i.d. uniformly 
distributed on a high-dimensional sphere. For this model, the marginal dis- 
tribution of the inner-product £y = XfXj has probability density Q p {u) = 
A p (l — u 2 )( p-3 '/ 2 , where A p is a normalization constant. Let £ p have the same 
distribution as -s/pXij, whose probability density is q p {x) = -j=Q p (-^=), and let 
Hp = L 2 (R,q' p (x)dx). By Lemma D.4, 



(k-iy.l + Okil)?- 1 , fceven; 
0, k odd, 



which echos Eq. (4.3). As a result, by Lemma 4.1, the orthonormal polynomi- 
als of Hp are asymptotically consistent with the Hermite polynomials. If we 

expand k(x;p) into the orthonormal polynomials of H , and require the con- 
ditions (C. Variance), (C.p-Uniform) and (C.ai) accordingly, the result in 
Thm. 3.4 still holds. 

One way of showing this is sketched as follows: 
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Condition on the draw of X n , and without loss of generality let X n = 
(1,0, ••• ,0) T . Then 

Xi = (ui,y/l-u?X?) T , l<i<n-l, 

where m's are i.i.d distributed, and JQ's are i.i.d. uniformly distributed on the 
unit sphere in K p_1 independently from itj's. As a result, let £y = XfXj and 
£ij = A'/ .V , , then 



£ij = UiUj + y 1 - ufyl - u^, 1 < i, j < n - l,i j, 
which is different from before. However, on the large probability set 

V s = {\ui\ < 6, |*y <8,l<i,j<n-l,i? j, S = p-V 2 M, M = ^Olnp}, 
it can be shown that 

£ij = UiUj + fa + nj, \nj\ < s 3 . 

Thus, the Taylor expansion can be carried out in the same way, where the 
contribution of the extra term is put into Tij and the bound Eq. (4.15) 
remains true. 

We still need the mean spectral norm bound to show that Eq. (4.17) holds, 
and to use the bound given by the 4th moment (Lemma 4.2 ), it suffices to 
establish the bound in Lemma D.l. Notice that Gcgenbaucr polynomials [1] 
are orthogonal in the space L 2 ([— 1, 1], Q (u)du). Gcgcnbauer polynomials are 
related to the p-spherical harmonics {4>j,j € J}, which form an orthonormal 
basis of L 2 {S p ~ 1 ,dP). J = U^ J/, and {cj)j(X) 7 j e J;} arc p-spherical harmon- 
ics of degree Z, which are homogeneous harmonic polynomials restricted to the 
surface of the unit sphere. The Gegenbauer polynomial of degree I as a function 
of X T Y, X, Y e 5 P_1 , up to a multiplicative constant, equals 

which is named "the Z-degree zonal harmonic function with axis X" . We thus 
define Gz )P (£) to be 

G l , p (X T Y) = Y J HX)HY)- (4-30) 

Notice that Gi iP (X T Y) = pX T Y, and by convention Go tP = 1. G;. p (£) is a 
polynomial of degree I for all I, and 



/ / G l , p {X T Y)G Kp {X T Y)dP{X)dP{Y) =8i, k \Ji\- 
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| J; | is the number of p-sphcrical harmonics of degree I, \ J\\ = p, and for I > 2 



Thus, the orthonormal polynomials P^ p (x) of the space H can be written as 



\Ji\ Vp 



Pt, p (x) = -^=G ltP (—). 



By Eq. (4.30), we have 

/ G Lp (X^X 2 )G lp (X^X 3 )dP(X 2 ) = Gi iP (X?X 3 ), X 1 ,X 2 ,X 3 e S p ~\ 
Jsp- 1 

which gives that (define = XfXj) 



nPi, P {y/P^)Pi,p{VP^)\X x ,X 3 ] = -1=^(^3). 



. \Ji\ 

As a result, EPiAy/p^PiAVP^^AVP^PiAVP^ 1 ) is bounde d by 

which is stronger than the estimate in Lemma D.l. We comment that carrying 
out this analysis to higher order moments gives a moment-method proof of the 
convergence to semi-circle law of the ESD of random kernel matrices where 
k{x;p) = Pi, p (x) for I > 2, under the model X l - UiSP- 1 ). 

To continue to show the result in Thm. 3.4, the mechanism in Sec. 4.3 applies 
to what follows in almost the same way. 

Another way of extending to the model where Xi ~ U{S P ~ 1 ) is by comparing 
to the standard Gaussian case. That is, to replace the Xi by Xj,/\Xi\ in the model 
Xi ~ A/"(0,p^ 1 / p ) and to bound the difference resulted in m^z) (reducing to 
the finite expansion case k — first). This "comparison" argument can be used 
to extend the result in Thm. 3.4 to other models of the distribution of X^'s, but 
we do not develop this idea any further here. 

5. Summary and Discussion 

The main theorem, Thm. 3.4, establishes the convergence of the spectral density 
of random kernel matrices in the limit p, n — > oo, p/n = 7, under the assumption 
that the random vectors are standard Gaussian. The theorem and the proofs 
also hold under the condition that p/n 7. Our proof is based on analyzing 
the Stieltjes transform of the random kernel matrix, and uses the expansion 
of the kernel function into orthonormal Hcrmitc-like polynomials. The limiting 
spectral density holds for a larger class of kernel functions than the cases studied 
in [91, which are smooth kernels. 
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The assumption that the random vectors are standard Gaussian can be weak- 
ened. We showed that the result extends to the case that they are uniformly 
distributed over the unit sphere. Numerical simulations (not reported here) in- 
dicate that the limiting spectral density holds for other non- Gaussian random 
vectors. This includes the case where X^s are uniformly sampled from the 2 P 
vertices of the hypercube {— p~ 1 ' 2 ,p~ 1 ' 2 } p (where the value of the sign kernel 
and the divergent kernel at x — is set to 0). The universality of the limiting 
spectral density is however beyond the scope of this paper. 

While our paper mainly focused on the limiting spectral density, another 
question of practical importance concerns the statistics of the largest eigenvalue 
of random kernel matrices. This include studying the mean, variance, limiting 
distribution, as well as large deviation bounds for the largest eigenvalue. As 
discussed in Remark 4.3, the bound in Lemma 4.2 for the expected value of 
the spectral norm is far from being sharp. Numerical simulations (not reported 
here) have shown that for the models studied in this paper, the largest /smallest 
eigenvalue lies at the right/left end of the support of the limiting spectral den- 
sity, and thus both of them are conjectured to be 0(1) almost surely. We are 
not aware of any result concerning the limiting probability law of the largest 
eigenvalue of random kernel matrices, except for the one in [9] where the kernel 
function is assumed to have strong (C 3 ) regularity. Many other interesting ques- 
tions can be asked from the RMT point of view, e.g. the "eigenvalue spacing" 
problem, namely the "local law" of eigenvalues. If the asymptotic concentration 
of the eigenvalues at the "local level" could be established, one consequence 
would be that the top eigenvalue can be shown to concentrate at the right end 
of the limiting spectral density. 

There are several interesting extensions of the inner-product kernel matrix 
model. The first possible extension is to distance kernel functions of the form 
f(Xi,Xj) = f(\Xi — Xj\), which arc popular in machine learning applications. 
Due to the relation 

\X i -X j \* = \X i \ 2 + \X j \ a -2X?X j , 

for the model where Xi ~ U(S P ~ 1 ), where \Xi\ = 1, distance kernels can 
be regarded as inner-product kernels. However, for the model where Xi ~ 
A/"(0,p _1 /p), the fluctuations in |Xj|'s do seem to make a difference, and so 
far we have not been able to draw any conclusion about the limiting spectrum. 

Another extension is to kernels that are of more general forms, neither an 
inner-product kernel nor a distance one. For example, a complex-valued kernel 
has been used in [15] for a dataset of tomographic images. Every pair of images is 
brought into in-plane rotational alignment. The modulus of the kernel function 
corresponds to the similarity of the images when they arc optimally aligned, 
while the phase of the kernel is the optimal in-plane alignment angle. Notice 
that this kernel is discontinuous, since a small perturbation in the images may 
lead to a completely different phase. Similar kernels with discontinuity have also 
been used for dimensionality reduction [14] and sensor network localization [7]. 
In many senses, these applications have been the motivation for the analysis 
presented in this paper. 
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Finally, it is also possible to extend the study to non-Hcrmitian matrices 
as follows. Suppose that X\,--- ,X m are m i.i.d random vectors in W, and 
Y\, ■ ■ ■ ,Y n are n i.i.d random vectors in W, independent from the X^s. The 
m-by-n matrix A is constructed as Aij = f(XfYj) where / is some function. 
The distribution of the singular values of A in the limit p,m,n — > oo and 
p/n = "fi,p/m = 72 is conjectured to converge to a certain limiting density. 
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Appendix A: Solution of the Equation of m(z) 

We rewrite Eq. (3.5) as 

— — — m 3 + (v + az)m 2 + (a + jz)m + 7 = 0, > 0, > 0, (A.l) 

7 

where a 2 < v. When a = (a 2 = v) the equation corresponds to the semi- 
circle distribution (M.P. distribution), and the existence and uniqueness of the 
solution with positive imaginary part are known. We consider the case where 
< a 2 < v, thus the cubic term in Eq. (A.l) does not vanish. 

Lemma A.l. For every z with 3(z) > 0, there exists a unique m with S(m) > 
for which Eq. (A.l) holds. 

Proof. It can be verified that whenever a, v : 7 are real and ^s(z) > 0, the solution 
m must not be real. Define the domain T> := {(a, v, 7, z), 7 > 0,0 < a 2 < 
v, 3(z) > 0} which has two connected components T> + = DO {a > 0} and 2?_ = 
T> n {a < 0}. The three solutions of the cubic equation depend continuously on 
the coefficients, thus if we let (a, v 1 7, z) vary continuously in T> +1 the imaginary 
parts of the three solutions never change sign, and similarly for T>_. As a result, 
it suffices to show that for one choice of (a, v 1 7, z) <G T> + and one choice in 
there is a unique solution with positive imaginary part. This can be done, for 
example, by choosing a = ±1/2, v = 1, 7 = 1 and z = i. □ 

The explicit expression for y{u) defined in Eq. (3.6) is given by 

y{U]a ^ ) = {%((VD + R)i + (VD-R)^ oil (A ' 2) 
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Fig 3. Function y(u; a, u,-y) as in Eq. (A. 2). 



where 

D = Q 3 + R 2 , 

R = (9a 2 a 1 - 27a - 2o|)/54, 
Q = (3Q!! - a|)/9, 

and 

to 3 + Q2?n 2 + aim + «o = 
is derived from Eq. (A.l) by multiplying (— ) _1 on both sides. Explicitly, 



n 2 



(q+7ti)7 
a{y— a 2 ) 5 



a - a(J-a'>) ■ 

So all of ct2, ax, ao, and thus R, Q and D are real numbers. D is the "discrim- 
inant" of cubic equation, where D turning from negative to positive signals the 
emergence of a pair of complex solutions. The function y(u; a, v, 7) is plotted in 
Fig. 3 where v = 1, a = s/2/tt and 7 = 0.1, 0.2, 0.3. Notice the invariance of Eq. 
(A.l) under the transformation 

vc 2 — > v, ac — > a, zc — > z, m/c — > m 



where c is any positive constant, which corresponds to multiplying the kernel 
function by c. 
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Appendix B: Lemma in Sec. 2 

Proof of Lemma 2.4- We need the Burkholder's Inequality (Lemma 2.12. of [4]), 
which says that for {7^, 1 < k < n} being a (complex- valued) martingale differ- 
ence sequence, for /3 > 1, 

n / n \P/ 2 

E\J2l k f<K fj E ]T|7 fc | 2 , (B.l) 

fe=i Vfe=i / 

where Kp is a positive constant depending on /3. Using the i.i.d. random vectors 
{Xi, 1 < i < n}, we will define the martingale to be 

M k =E(Tr(A-zI)- 1 \tr{X k+1 ,--- ,X n }) := E k Tr(A — zl) -1 , < k < n, 

where o~{X/.+i, • • • , X n } := !F n -k denotes the cr-algebra generated by {Xj, k + 
1 < i < n} and E(-|(?) the conditional expectation with respect to the sub- 
cr-algebra Q. We have M n = ETr(A - zl)- 1 and M = Tr(A - zl)- 1 , and 
M n , • • • , Mo form an martingale with respect to the filtration {J^, t = 0, • • • , n}. 
The martingale difference 

7 fe = M fe _i - M fc 

= E k ^Tr(A - zl)- 1 - E fc Tr(A - zl)- 1 
= E fc (Tr(A - zl)- 1 - Tr{A { ^ - zl)' 1 ) 

- E fc _ 1 (Tr(A - zl)' 1 - Tr(A^ - zl)- 1 ) (B.2) 

where A^ is an (71 — l)-by-(n — 1) matrix that is obtained from the matrix A 
by eliminating its fc-th column and fc-th row. Notice that A^ is independent of 
X k , Efe_iTr(^( fe ) - zl)- 1 = E fe Tr(^( fc ) - zl)' 1 , which verifies the last line of 
Eq. (B.2). At the same time, we have 

\Tr(A - zl)- 1 - Tr(A^ - ziyM < -, (B.3) 

v 

where v = Q(z) > 0, using an argument similar to that in Sec. 2.4. of [18] (sec 
Eq. (2.96)). The way to show Eq. (B.3) is by making use of (1) that the ordered 
n — \ eigenvalues of a minor of a symmetric (or Hermitian) matrix A 'interlace' 
the ordered n eigenvalues of A, which follows from the Courant-Fischer theorem 
(see, for example, Exercise 1.3.14 of [18]), and (2) that for fixed z both real and 
imaginary parts of (t — z)- 1 as functions of t have bounded total variation. As 
a result, 

|7fc| < |E fe (Tr(A - zl)- 1 - Tr(A« - zl)- 1 )] 

+ \E k ^(Tr(A - zl)- 1 - Tr(A^ - zl)- 1 )] 
4 

< 2- := C. 

v 
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and then with Eq. (B.l), choosing /3 = 4, 

n 

E|m A -Em A | 4 = — E| V 7fc | 4 



fe=l 



s>(|>i 2 ) 2 

< -^AifnC 2 ) 2 = 0(l)n" 2 . 
n 

This implies the almost sure convergence of vtla — Eto^ to by Borel-Cantelli 
lemma. Also, Eq. (2.8) follows by Jensen's inequality. □ 

Lemma B.l. Notations as in Sec. 2.2 



E 



m A (z) - \ -z - \ 1 - ( 1 + -Em(z) 



Remark B.2. The proof provided below can be replaced by a simpler one. The 
reason we give this proof is that it contains many of the techniques that are 
used in showing the main result. 

Proof. Continue from Eq. (2.15). We first observe that when p is large, \X n \ 2 
concentrates at 1 , and specifically, with p large enough 



Pr 



II > 



140 Inp 
V 



<p- 



(B.4) 



which can be verified by standard large deviation inequality techniques. How- 
ever, at this stage the following moment bound will be enough for our purpose: 



E||X„| 2 -l| < ^/E(|X n | 2 -l) 2 = ^-->0. 
We then write the denominator in Eq. (2.15) as 

77 T (i(") -D„- Sln-l)- 1 *? = -»(A (n) " Zln-l)- 1 

P 

1 „ . , 1 . 

= —E/m(z) H — r + r, 

7 7 



(B.5) 



(B.6) 



wherer = ^(iW-^-^-^-^-iTr^W-^-i)- 1 , m(z) := ^Tr(A^- 
zln-i)^ 1 , and f := (fh(z) — Em(z)). We have that 

1. E|f| < 0(l)n _1 / 2 as n -> oo: Because is itself an (n — 1) x (n — 1) 
kernel matrix by Eq. (2.14), Lemma 2.4 applies. 
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2. r splits into two terms 

r = (V (i<"> -D v - zln-i)- 1 ] - r, T (i<") - zln-i)- 1 ??) 

+ (V 04 (n) - ^-x)" 1 ?? - ^Tr(A^ - z/n-i)- 1 ) 
:= ri +r 2 , 

where (1) E|r 2 | < 0(l)p~ 1/2 , by Lemma B.4; (2) |n|ln a < 0(l)p-^ 2 , 
where Qg is a large probability set depending on p, defined as 

M 



Q s = {\n\ <6,l<i<n-l,5= —}, M = y/20\np, 

by Lemma B.3. Notice that M = o(p e ) for any e > 0. 
Back to Eq. (2.10). By Eqs. (2.15) and (B.6), we have 
Em A (z)=E((A-zI)- 1 ) nn 

z - \X n f (l - (1 + -Tr(A^ ~ zl n . x )- x + r)" 1 
V P 

The following bounds (1) - (4) can be verified: 

(1) (Lemma B.8) On fl s , ^{A^ - zl n -i)- x r)\ and |(1 + r] T (A in ) - D n - 
zl^)- 1 ^)- 1 ] are both bounded by M' = 1 + 0(1)M 2 , M' = o{p e ) for 
any e > 0. 



(2) (Lemma B.7) On Q r n Sl s , 



(l + iEm(z)) 



< 2M , where we define 



(3) 
(4) 



O r = {|f| <p- 1 '\\r 2 \ <p-^ 4 }, 

and by Markov inequality, we have 

Pr(fi°) < p 1/4 E|f| +p 1/4 E\r 2 \ < 0(l)p- 1/4 

when p is large. 
((A- si)" 1 ) | < i, which is Eq. (2.4). 



-z- (l - (l + fEm(z)) ^ 



< ±: By9 - 1 + iEmW 



equals a positive number times 3 ^Em(z)j which is also positive, one 
verifies that 



< 9f(— z) = —v.. 



3 [ -z - ( 1 - ( 1 + -Ero(z) 



so 



-z - ( 1 - ( 1 + -Erh(z) 



> v. 
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With (1) and (2), wc have 



E 



<E(M ■2M)(|r| + -|f|)-ln 4 nn P 
7 

< 2Af' 2 (E|r 2 | +E|n| • 1^ +7 _1 IE^I) 

< 2Al' 2 (0(l)p- 1/2 + 0{l)p~ 1/2 + 0{l)n-^ 2 ) 
= 0(l)Al' 2 p- 1/2 . 

Using bounds (l)-(4), together with Eqs. (B.7) and (B.5), we have 
E m A {z) - (-z - ( 1 - (l + ~Em(z) 



(B.7) 



(-z-\X n \ 2 r?{A^ - zln-xT 1 ^) 



( 1 + -Em(z) 



<-(Pr(nj)+Pr(n?)) 



E 



(-^-I^IV^W-Z^!)- 1 ,,)' 



1 - ( 1 + -Em(; 



<-(Pr(n?)+Pr(fi=)) 



+ e! ||X„| 2 - ll • h T (A(") - zln.x)- 1 ^ ■ lfi, 



IT 

4 

v- 



( 1 + -i-Em(z) 



1 



1 



< - (Pr(ng) + Pr(Q£)) + E- |X„| 2 - 1 M lo.no. + ^^(1)M 2 p 

< C(1K 9 + 0{l)p- 1/A + M'0(1) P - 1/2 + 0(l)M'V 1/a 
= o(p- 1/2+£ ), 

for any e > 0, which proves the statement. 
Lemma B.3. Notations as in Lemma B.l, 



'a„-i/a 



□ 
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Proof. By 



Pr[|Tfc| ><5] = 2 



M 



-.e 2 du 



1 



(Bi 



= -^p~ 10 , 1< i < n - 1, 

and the union bound, we have 

Pr(0|) < (n- l)Pr[|77i| > 5] < 0(l)p" 9 . 

Now (recall that s(-) denotes the magnitude of the largest singular value/spectral 
norm of a matrix) 



= V T (A^ -D v - zI n _ 1 )- 1 J D„(^ (n) - zln-i^V 
< s ((AW - D. n - zI^y'D^A^ - zl^)- 1 ) \r 1 \ 2 . 



Notice that on Vis 



s(Aj) < max yyf < <5 2 



also \i]\ 2 < (n - l)5 2 . At the same time both s((^4 (n) - D„ - z/„_i) _1 ) and 
s((A^ — z/ n _i) _1 ) is bounded by - an absolute constant. Adding together (for 
Hermitian matrices A and B, s(AB) < s(A)s(B)) we have 



Nln 4 <-S 2 -(n-l)S 2 

V 



M i (n-l) 



(B.9) 
□ 



Lemma B.4. Notations as in Sec. 2.2, 

EM < o(i) P -^ 2 . 

Remark B.5. The technique is similar to the moment bound method in [4, 
Chapter 3.3], where the main observation is that A^ is independent of the 
vector 7]. 
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Proof. Define (A^ — zl n -i)~ l as B which is Hcrmitian, we have 



E|r 2 | 2 = 



/ 1\ ~ 



+ E E E VhV^XV^Bi^B^. 

By taking expectation over rj^s first, we see many terms vanish due to the 
independence of rji 1 and rji 2 for it ^ 12, and what remains gives 



E|r 2 | 2 < 



4 EW + Eis- 



Observe that 



V 

E^rTr{B T B). 

p2 



_rp n—1 



h^i 2 



< V 



1 n-1 



where v = 5s (z) > and Xi are the eigenvalues of A^ n \ Then 

2 ~ s 2 n-1 2 1 

-Tr(B B) < -5—5- < — •-, 



which means that 



so we have E|r 2 | < v / EhF < 0(l)p~ 1/2 
Lemma B.6. Notations as in Sec. 2.2, 



□ 



E\m A (z) - m(z)\ -> 0. 
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Proof. First, \rriA(z) — m A ( n )(z)\ < | • n 1 — > 0, due to Eq. (B.3). Second, we 
show that Elm^n) — rn A(n) | — > 0. By 

Tr(A^ - zln-x)- 1 - Tr(A^ n) - zi^i)" 1 
= Tr(-(A^ - zln-x)- 1 ^ - - Zln-l)" 1 ) 

+ Tr((^W - 2!l B _ 1 )- 1 A,(i (B) - zln-l)" 1 ), 

and using a similar argument as before, we can show that on fl$ 



<-M 2 <-(n-l)S 2 <0(l)M 2 , 

v v 



and 



Tr((AW - zln^D^A^ - Zln-x)- 1 ) 

< (n - - z/n-i)- 1 !),^ - z/n-i)- 1 ) 

< -l(n- 1)<S 2 = 0(1)M 2 . 



As a result, 

2 

E|m A( „) - m j 5 ( „j | = - Pr(^) + E\m AM - m A(n) \ ■ l Us 
v 

< 0(l)p- g + -0{1)M 2 
n 

which goes to as n,p — > oo with p/n = 7. 
Lemma B.7. Notations as in Sec. 2.2, on il r n Qa, 

-1 



1 

1 + -Em(z) 

7 



< 2M . 



□ 



Proof. On n r n fi 5 , with Eq. (B.9) |n| < ©(l)^ 1 / 2 thus |r| < |n| + |r 2 | is 
bounded by ©(l)]?" 1 / 4 , 



1 + ±Em(» 



1 + ?/ T (A(«) — D n - zln^)- 1 ^ - r - if 
1 



< 

l + r^lW-A,-^-!)- 1 ?? 

< 2M' 



-H-i|r| 



as 1 + 77 T (i ( ") - A, - z/n^i)" 1 ?? > 1/M' > (\r\ + i|f|), where the latter is 
bounded by ©(l)^ 1 / 4 . □ 
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Lemma B.8. Notation as in Sec. 2.2, on Q 5 , both \n T (AW - z7 n _i)- x r?| and 
\(l + r) T (A^ -D v - zI n ^ 1 )- 1 T 1 )- 1 \ are bounded by M' . 

Proof. On Sl s , \v T (A^ - zln-i)- 1 ^ < s((A^ - zl^)- 1 )^ 2 < \5 2 {n-l) = 
0(1)M 2 , and also 



= 1 - 77 T ( W T + A™ - A, - Zln-l)-^ 

< 1 + \ V T (A^ - Zln-!)- 1 ^ 

< l + C(l)Af 2 :=M .U 



Appendix C: Lemma in Sec. 3 

Lemma C.l. Model and notations as in Sec. 3.1. Due to Eqn. (4-3), the result 
in Lemma 4-1 holds. 

Suppose that k(x;p) is in T-L^j- and H p for all p, and satisfies 



k(x;p) 2 \q p (x) — q(x)\dx — > 0, p — > 



Let 



for 1 = 0,1, 
Proof. 



h,p = / k(x;p)hi(x)q(x)dx, 
Jr 

ai, P = / k(x;p)Pi yP (x)q p (x)dx, 

Jr 

Then for each I, \b^ p — ai tP \ — > as p — > oo. 



\bi,p — a i,p\ 

= | / khi(q - q p )dx + / k(hi - Pi tP )q p dx\ 
Jr Jr 

< J \khi\\q - q p \dx + J \k\\h t - P lyP \q p dx 

:=(!) + (2). 
For (1), by Cauchy-Swarchz 

(I) 2 < ( / k 2 \q - q p \dx)( J hf\q - q p \dx), 

where 

hf\q-q p \dx< I h'fqdx+ I hfq p dx = l + {l + O t (l)p^ 1 ), 
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which is bounded as p — > oo, and J k 2 \q — q p \dx —> 0, thus (1) — » 0. For (2), 
(2) 2 < ( f k 2 q p dx){ [ (hi - P hp ) 2 q p dx), 



where J k 2 q p dx — > J k 2 qdx which is bounded, and by Lemma 4.1 

i 

hi(x) - Pi, p {x) = VO^p)^, max \(5 ltP )j\ < Oi{l)p~ l , 

* — * 0<7<t 

thus 

1/2 



(h-p l>p y qp dx\ <Oi(i) P -\ 

so (2) -4-0. □ 

Lemma C.2. Model and notations as in Sec. 3.2, and suppose that k(x) is as 
in Remark 3.2. Eqn. (3.4) implies that Efc(£ p ) 2 — > Efc(C) 2 = ■ Without loss 
of generality, k(x) is in T-L p for all p. Define bi >p and ai tP as in Lemma C.l, and 
notice that since k(x) does not depend on p, 6; iP = bi independent of p. 

Then conditions (C. Variance) , (C.p-Unform) and (C.ai) are satisfied by 
k(x;p) = k(x) — Oo, p . Also, v p — > , and ai tP — > a_\f = b\. 

Proof. By definition Ek(^ p ;p) = 0. In this case, 

v p = Ek{t p ;p) 2 =m{t p f -al p . 

Since Lemma C.l applies to k(x), we know that 

ao, P ->• b a = Efc(C) = 0. 

Together with the fact that Efc(£ p ) 2 — s- Efc(£) 2 = vjs/, we know that v p —> vjj as 
p — > oo. Thus (C. Variance) is satisfied. 

Also a± tP b\ which is a constant, thus (C.ai) holds. 

For (C.p-Unform) to be satisfied, it suffices to show that Y^a=l+i a lp can 
be made p-uniformly small. Notice that 

oo 

XX* = Efc (c P ) 2 

1=0 

oo 

1=0 

and meanwhile for each I, ai tP — > bt by Lemma C.l, thus for any finite L 

oo L 

J2 a 2 p =Ek(Z p ) 2 -J24 P 

l=L+l 1=0 

L OO 

1=0 l=L+l 

which can be made small by choosing L large independently of p. □ 
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Lemma C.3. Notations as in Sec. 3.2. If f(£;p) = /(£) is C 1 at £ = 0, then 
the theorem applies and a 2 = v. Specifically, a = f (0). 

Proof. Wc first truncate /(£) to be f(£;p) = f(Ql{\£\<6}, where 6 = 6{p) = 
M = y20 In p. Using a similar argument as in Lemma D.2, we have 

Prpi^i, \X?X j \>d]<0(l)p- 7 . 

Thus, if we denote A as the random kernel matrix with kernel function /, then 
for fixed z = u + it) 

E\m A (z) ~ m A (z)\ < ~ Pr{3i ^ j, \Xj Xj\ > 6} -> 0, 

where uia(z) and m^(z) are the Stieltjes transforms of A and A respectively. 
Since the convergence of Em A (z) implies the convergence of the spectral density, 
if suffices to show that the claim in the lemma holds for f(£;p). 

Since /(£) is C 1 at £ = 0, for any e > there exists a neighborhood [-R, R] 
on which 

f(O=f(0) + f'm + r(O, \r(Z)\ < e\£\. 

Since S — > when p — > oo, we assume that p is large enough so that 5 < R. 
Let k(x;p) = v /pf(x/ y /p), and assume that /(0) = since it only contributes 
to Ek(£ p ,p) = ao.p, wc have 

k(x;p) = yf (0)x + ^/pr(-^=)^ 1{| k |<m} 
:= ki + k 2 , 

where |fc2(a;;p)| < e\x\, so Ek 2 (£ p ;p) 2 < e 2 . Thus, the 1? norm of k% is arbitrarily 
small in H p , and v p = Var{k{£ p ;p)) and ai tP = E£ p (k(£ p ; p) — Ek(£ p ;p)) are 
decided by k\. For k\{x;p) = / (0)x1{\ x \<m}, = 0, and since M — > 

oo as p -> oo, Efci(£ p ;p) 2 -> (/'(0)) 2 and E£ p ki(£ p ; p) -> /'(0). Thus z^ p -> 
(/'(O)) 2 ^, and aij, /'(0) = o. □ 

Lemma C.4. Lef £ p &e as in Sec. 3.2, and equivalently £ p = p~ 1 ^ 2 ^2 v i= iXiyi 
where Xi and yi i.i.d.^ Af(0, 1). Then for p > 2, 

Pr[|&| > i?] < (2e) e " fl . 

Proof. Since for |t| < ^/p, Ee* ^ = (1 — t 2 jp)~ x ^ 2 , by choosing t=lwe have 

Pr[£ p > R] < e- M (Ee^) p 

= e- M (l - -)-v/ 2 
P 

< e~ M e, 

where the last line is due to that x = 1/p satisfies log(l — x)/x > —2 when 
< x < 1/2. The argument for bounding Pr[£ p < — R] is similar. □ 
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Lemma C.5. Notations as in Sec. 3.2. Suppose k(x) is (Case 1) bounded, or 
(Case 2) in H_\f and % p for all p, is bounded on \x\ < R for any R > 0, and 
satisfies 

/ k(x) 2 q p (x)dx — >• 0, R — >• oo 

J\x\>R 

uniformly in p, then Eqn. (3.4) holds. 

Proof. First, we reduce (Case 2) to (Case 1). Notice that 

/«fl*W-*ll* 

Jw 

< / k(x) 2 \q p (x) — q{x)\dx + / k(x) 2 q p (x)dx 

J\x\<R J\x\>R 



[ k { xfq { x)dx. 

J\x\>R 



+ 

l\x\>R 

The last two terms can be made arbitrarily small independently of p by choosing 
R large, and for fixed R, the first term goes to given that (Case 1) is proved. 

To show the claim for (Case 1), it suffices to show that J \q p — q\dx —> 0. 
Since £ p converge in distribution to Af(0, 1), we know that for any finite R, 
I\x\<r \lp( x ) ~ q( x )\dx — > 0. Thus, it suffices to show that 



/ q p (x)dx — > 0, R — > oo 

J\x\>R 



l\x\>R 

uniformly in p. This follows from the large deviation bound that is given in 
Lemma C.4. □ 



Appendix D: Lemma in Sec. 4 

Lemma D.l. Let Xi,X 2 ,X$,X4 be i.i.d distributed as J\f(Q,p~ 1 I p ), and Pi , p (x) 
is the degree-l Hermite-like polynomial as defined in Sec. J^.l, I > 2. Then 

where £jj = XfXj . 
Proof. Write 



62 = \Xi\f] 2 , £14 = |^i|r?4, 63 = mm + 123, C34 = mm + i 



34 



where 772,773,774 are i-i-d distributed as J\f(0,p 1 ), and \X\\, rji's and £23,^34 are 
jointly independent. Since Pi tP (x) is a polynomial of degree I, 

Pl, P (xi+x 2 ) = Pi, P (x 2 ) + x 1 P l>p (x 2 ) + P( 2 ){x X ,X 2 ) 



P(2){X1,X 2 ) = ^ 



1 iffte) 
k\ 

k=2 
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thus 



mAVp\Xi\v2)PtAVp\Xi\m) 

PlAVP&3) + VPV2TI3P1AVP&3) + P(2)(VPV2V3, Vp&3) 
■ PiAVP&A) + VPV3V4.PlAVPt.34) + P(2)(\/PV3V4, ^34) ■ 

Notice that E773 = 0, thus it suffices to show that 

ePiAVp\^V2)PiAVp\ x iMPiAVp^ 3 )PiAVp^), 
Ey/wiiPiAVvn^Xi]) ■ VpmPiAVp\Xi\m) ■ vl ■ p'iAVp^pIAVp^), 
e^/pt] 2 PiAVpv2\Xi\) ■ PiAVp&s) ■ Pi,p(Vp\ x i\vi)v3P(2)(Vpv3m,Vpi34), 

E ^,p(VPl X ll 7 ?2)fl,p(v / p|^l|?74)^(2)(VP ? ?2773, VP&3)P(2)(\/PV3V4, VP^) 

arc all at bounded by Oi(l)p~ 2 . This can be done by making use of the facts 
that E|Xi| 2m = 1 + Omitfp- 1 (Eq. D.l), and that the differences between 
the coefficients of Pi, P (x) and those of hi(x) are (Lemma 4.1). The 

condition I > 2 is needed to guarantee that Pi iP (x) and x are asymptotically 
orthogonal. □ 

Proof of Lemma 4-4- We have 

m A (z) - m B (z) 
1 



(Tr((A - zl)- 1 ) - Tr((S - ziy 1 )) 
-Tr((A - ziy x (B - A)(B - zl)~ l ), 



thus 



E\rriA(z) - m B {z)Y 
1 



= E— (Tr((A - zI)- x (B - A)(B - ziy 1 )) 2 
n z 

< E-^Tr(((5 - ziy^A - z/)~ 1 ) 2 )Tr((i? - A) 2 ) 

n 2 

1 " 
1 " 

< — E nfA{xfx 3 - P ) - f B {xjx f)P )) 2 



1 



< — n 2 p- 1 e = 0{l)e.U 

Lemma D.2. Let 0,$ be defined as in Eq. (4-12), 

Pr(Og) < 0(l)p- 7 . 
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Proof. For t]i we have the concentration inequality Eq. (B.8); For each we 
write it as 

£,i j = I Xi I fjij , 

where fjij has marginal distribution Af(0,p~ 1 ) and is independent of \Xi\. With 
inequality Eq. (B.4) which also holds for \Xi\ in place of \X n \, we have 



Pr[|Xi||ffc| >6] <Pr 



l^l 2 > 1 



140 Inp 

P 



Pr 



X i \\fj i j\>6,\X i \ 2 < 1 
<5 , 



1 40 \np 
P 



<p- y + Pr[|^| > T 

1 _c 



<p~ 



V2 



V 



thus 



Pr[|&| > 8] < 0(l)p- 9 . 
Then, a union bound gives 

Pr(fig) < (n-l)Pr[| ?7?; | > 5} 

+ (n-l)(n-2) > 5] +Pr[ || Xn|2 _ X | > ^ 

< o(iK 9 + o(i)^ 7 + p- 9 = o{i)p- 7 .n 

Lemma D.3. Notation as in Sec. J^.3. ri defined in Eq. (4-20) satisfies 

EN • l n , < Ol(1)MV 1/2 - 
Proof. From Eq. (4.21), firstly, 

r 2 . 1 = ff 2) (A^ - zl^yWX^) 

satisfies E|r2,i| < 0l(1)p -1 / 2 by a moment bound: recall the definition of £j„ 
as in Eq. (2.12), and that />i(£) is a linear combination of rescaled and renor- 
malizcd Hcrmite-like polynomials of degree > 2. Also, E|X„| 2m = l + m (l)j5 _1 
(Eq. (D.l)), and is independent from 77^ 's and AVs. Denote B = (A^ n ' — 
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zl n -x) • By taking expectation over \X n \ first and then over t^'s, we have 



10 



E|r 2 ,i| 2 



n-l 



f>l{£im)£,i 2 nBi 



ii,»2 = l 



E ^2 ^1 />ltem)^2n/>l(^^„)C i ; n -Biii2^i' 1 4 
= {«1 = 12 = «i = « 2 } + {*1,*2 = i\ = «2)° r *l aS *l} 

+ {i 2 = + {h = h,i\ = *2: or *i as *i} + {*i = h>*2 = *2} 

= + ^>i, P p- 2 ETr(B T B) 

< OitlK 1 + 0(1) -p" 2 -^ = OiClK 1 , 

where by {«i,i2 = *i = i 2 } we denote the terms in summation where the last 
three indices take the same value while i\ is distinct from them, and similar for 
others. 
Secondly. 

r 2 ,2 = (/ ( T 2 )(i (n) - zI n -xY\\X n \ n )){a^ T {A^ - zl n _ x y\) 
= r 2A (a 1 (p)n T (A^ - zln-^v), 

where 

\ ai ( P ) v T (A^ - zln-xTW ■ l n , < - z/^r^lMI 2 • ln 4 

< C(l)7\/ 2 = C(1)M 2 , 



thus 

E|r 2 , 2 | ■ In, < C(l)M 2 E|r 2 ,i| < 0(1)M 2 • L (l)p-^ 2 = O l (1)A/V 1/2 ■ 
Then 

E|r 2 | • l n , < 0(l)(E|ra,i| + E|ra, 2 | ■ In,) < L (l)M 2 p-^ 2 . 
Lemma D.4. Notations as in Sec. 4-4; 



□ 



(k- 1)!! + Ofe(l)p- 1 , k even; 
0, k odd. 



Proof. The odd moments vanish since the distribution of £' is symmetric with 
respect to 0. For even moments, let k = 2m. Let £ p = ^fpX T Y where X and 
Y are i.i.d 7V(0, and we have that £ p and £ p |X||F| observe the same 

probability distribution. Notice that , \X\ and |F| arc independent, so 



E£ 2m = E|X| 2m E|r| 2m E(£ p ) 2m = (E|X| 2m ) 2 E(£ p ) 2 " 
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By Eq. (4.3), to show the claim it suffices to show that E|Y| 2m = l + O m (l)p" 1 . 
To do this, define 

Due to the mutual independence of the Xj's, the odd moments of r vanish; 
Er 2 = 2p _1 , and generally for even I 

E (v/fr) =(l-l)U + O l (l)p-\ 

so Er' = Oi(l)p-V 2 . Then 

E\X\ 2rn =E(1 + r) m 

m 

= 1+ c(l,m)Er l 

1—2,1 even 
m 

= 1+ c(i,m)0,(l)p- ,/a 

1—2,1 even 

= i + o TO (i)p- 1 .n (d.i) 
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